iT邦幫忙

2023 iThome 鐵人賽

DAY 6
0
DevOps

SRE/K8S 碎碎念系列 第 6

Day 6 PDB, Permission, 跟 Deploy

  • 分享至 

  • xImage
  •  

我們從 K8S 1.23 生到 1.24的時候,有次遇到 deploy 部署失敗,logs 如下

module.eks.aws_eks_node_group.node[0]: Still modifying... [id=xxx-prod:xxx-prod-node-1, 36m50s elapsed]
Error: error waiting for EKS Node Group (xxx-prod:xxx-prod-node-1) version update: unexpected state 'Failed', wanted target 'Successful'. last error: 1 error occurred:
	* ip-xxxx.xxxx.compute.internal: PodEvictionFailure: Reached max retries while trying to evict pods from nodes in node group xxx-prod-node-1

我們可以發現出問題的動作發生在某一台 node 執行Pod Eviction 時 fail了,查找 root cause 發現我們升級 EKS 時,會有 rolling out 的過程。有一個 pod 他不斷試著關閉,但一直沒有成功。導致 node 無法做 rolling 切換。

$ kubectl get node -A
NAME                                             STATUS                        ROLES    AGE    VERSION
ip-1   Ready                         <none>   111m   v1.23.17-eks-a59e1f0
ip-2   Ready                         <none>   118m   v1.23.17-eks-a59e1f0
ip-3    Ready                         <none>   114m   v1.23.17-eks-a59e1f0
ip-4   Ready                         <none>   32m    v1.24.11-eks-a59e1f0
ip-5   Ready                         <none>   121m   v1.23.17-eks-a59e1f0
ip-6    Ready                         <none>   121m   v1.23.17-eks-a59e1f0
ip-7    NotReady,SchedulingDisabled   <none>   32m    v1.24.11-eks-a59e1f0

過一陣子 timeout 後,會發現新開的 1.24 都被關閉,因為舊的 pod 卡住,導致舊的 node 無法關閉

$ kubectl get nodes -A
NAME                                             STATUS   ROLES    AGE    VERSION
ip-1   Ready    <none>   122m   v1.23.17-eks-a59e1f0
ip-2   Ready    <none>   130m   v1.23.17-eks-a59e1f0
ip-3    Ready    <none>   126m   v1.23.17-eks-a59e1f0
ip-5   Ready    <none>   132m   v1.23.17-eks-a59e1f0
ip-6    Ready    <none>   132m   v1.23.17-eks-a59e1f0

我們改使用 drain 試試看,kubectl drain 是用於 Kubernetes 中的維護和升級 node 的命令。當需要對 node 執行升級或維護時,可以使用 kubectl drain 來「安全地」驅逐該 node 上的所有應用,確保正在運行的 Pods 能夠在其它可用 node 上重新安排並繼續提供服務。

$ kubectl drain ip-10-13-4-151.ap-northeast-1.compute.internal --ignore-daemonsets --delete-local-data
node/ip-10-13-4-151.ap-northeast-1.compute.internal already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/aws-cloudwatch-logs-aws-for-fluent-bit-88ccw, kube-system/aws-cloudwatch-metrics-qw5fx, kube-system/aws-node-57vpv, kube-system/ebs-csi-node-chvvs, kube-system/kube-proxy-zfrrh
evicting pod kube-system/coredns-xxx-5l6lf
.
.
.
evicting pod xxx-prod/xxx-xxx-api-service-deployment
error when evicting pod "cluster-autoscaler-aws-cluster-autoscaler-7c86765cf4-bmbnq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/coredns-59847d77c8-5l6lf evicted
.
.
pod/aws-load-balancer-controller-c9cd98dd6-sbzng evicted
evicting pod kube-system/cluster-autoscaler-aws-cluster-autoscaler-7c86765cf4-bmbnq
error when evicting pod "cluster-autoscaler-aws-cluster-autoscaler-7c86765cf4-bmbnq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod kube-system/cluster-autoscaler-aws-cluster-autoscaler-7c86765cf4-bmbnq
error when evicting pod "cluster-autoscaler-aws-cluster-autoscaler-7c86765cf4-bmbnq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod kube-system/cluster-autoscaler-aws-cluster-autoscaler-7c86765cf4-bmbnq
error when evicting pod "cluster-autoscaler-aws-cluster-autoscaler-7c86765cf4-bmbnq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

使用之後發現原來是 cluster autoscaler 的 pod 卡住了,至於解法我們留到下一章節來提


上一篇
D5 EKS Plugin
下一篇
[2023-07] PDB
系列文
SRE/K8S 碎碎念30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言